HELOC

Exploratory Data Analysis with Feature Engineering and Selection

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt 
import seaborn as sns 
from IPython.display import display
import math
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as offline
import cufflinks as cf
py.init_notebook_mode(connected=True)
init_notebook_mode(connected=True)
offline.init_notebook_mode()
cf.go_offline()


color = sns.color_palette()
In [2]:
heloc = pd.read_excel("heloc_dataset_v1.xlsx")
display(heloc.tail(n=5).transpose())
10454 10455 10456 10457 10458
RiskPerformance Good Bad Bad Bad Bad
ExternalRiskEstimate 73 65 74 72 66
MSinceOldestTradeOpen 131 147 129 234 28
MSinceMostRecentTradeOpen 5 39 6 12 1
AverageMInFile 57 68 64 113 17
NumSatisfactoryTrades 21 11 18 42 4
NumTrades60Ever2DerogPubRec 0 0 1 2 0
NumTrades90Ever2DerogPubRec 0 0 1 2 0
PercentTradesNeverDelq 95 92 100 96 100
MSinceMostRecentDelq 80 28 -7 35 -7
MaxDelq2PublicRecLast12M 6 6 6 6 7
MaxDelqEver 6 6 8 2 8
NumTotalTrades 21 12 18 45 5
NumTradesOpeninLast12M 5 0 1 0 2
PercentInstallTrades 19 42 33 20 60
MSinceMostRecentInqexcl7days 7 1 3 6 3
NumInqLast6M 0 1 4 0 3
NumInqLast6Mexcl7days 0 1 4 0 2
NetFractionRevolvingBurden 26 86 6 19 67
NetFractionInstallBurden -8 53 -8 -8 -8
NumRevolvingTradesWBalance 5 2 5 4 2
NumInstallTradesWBalance 2 2 -8 1 1
NumBank2NatlTradesWHighUtilization 0 1 0 0 0
PercentTradesWBalance 100 80 56 38 100

Balance of input datastet

In [3]:
# Investigate balance for both output classes
n_records = heloc.shape[0]
n_riskperf_good = heloc.loc[heloc['RiskPerformance'] == 'Good'].shape[0]
n_riskperf_bad = heloc.loc[heloc['RiskPerformance'] == 'Bad'].shape[0]
good_percent = (n_riskperf_good/n_records)*100

print("Total number of records: {:0,}".format(n_records))
print("Individuals with good RiskPerformance: {:0,}".format(n_riskperf_good))
print("Individuals with bad RiskPerformance: {:0,}".format(n_riskperf_bad))
print("Percentage of individuals with good RiskPerformance: {:0.2f}%".format(good_percent))
Total number of records: 10,459
Individuals with good RiskPerformance: 5,000
Individuals with bad RiskPerformance: 5,459
Percentage of individuals with good RiskPerformance: 47.81%
In [4]:
temp = heloc["RiskPerformance"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
print(df)
df.iplot(kind='pie',labels='labels',values='values', title='RiskPerformance')
  labels  values
0    Bad    5459
1   Good    5000

Featureset Exploration

FICO uses 5 broad categories of information, each assigned with a relative importance, to determine credit score. The features provided in the input dataset have been subjectively mapped to these categories below (relative weight in bracket) in order to develop a better intuition of the input features.

  • Payment History (35%) : Considers payment history from different types of accounts, public record and collection items and details on late or missed payments.
    • NumSatisfactoryTrades: Number of credit agreements with on-time payments
    • NumTrades60Ever2DerogPubRec: Number of trade lines on a credit bureau report that record a payment received 60 days past its due date. This feature also checks all Public Records available for the consumer, and adds to this count any items considered “Derogatory”.
    • NumTrades90Ever2DerogPubRec: Number of trade lines on a credit bureau report that record a payment received 90 days past its due date. This feature also checks all Public Records available for the consumer, and adds to this count any items considered “Derogatory”.
    • PercentTradesNeverDelq: Percent Trades Never Delinquent
    • MSinceMostRecentDelq: Months Since Most Recent Delinquency
    • MaxDelq2PublicRecLast12M: Categorical variable denoting how severe was delinquency in last 12 months, if at all and known
    • MaxDelqEver: Categorical variable denoting how severe was delinquency in the past, if at all and known
  • Amount of Debt (30%) : Considers total amount owed across all accounts, amounts owed on specific types (revolving, credit) of accounts, number of accounts with balance, credit utilization ratio on revolving accounts, remaining amount owed on instalment loans
    • NetFractionRevolvingBurden: Revolving balance divided by credit limit
    • NetFractionInstallBurden: Installment balance divided by original loan amount
    • NumBank2NatlTradesWHighUtilization: Number of credit cards carrying a balance at 75% of its limit or greater
    • PercentTradesWBalance: Percent Trades with Balance
  • Length of Credit History (15%) : Considers age of oldest account, average account age and age of specific types of accounts (credit card, auto loans, etc)
    • MSinceOldestTradeOpen: Months Since Oldest Trade Open
    • AverageMInFile: Average history length of trades in months
  • New Credit (10%) : Considers the number of new accounts, how long since new account opened, number of recent requests for credit and rate shopping for a single loan
    • MSinceMostRecentTradeOpen: Months Since Most Recent Trade Open
    • NumTradesOpeninLast12M: Number of Trades Open in Last 12 Months
    • MSinceMostRecentInqexcl7days: Months Since Most Recent Inquiry excl 7days
    • NumInqLast6M: Number of Inquiry Last 6 Months
    • NumInqLast6Mexcl7days: Number of Inq Last 6 Months excl 7days (to account for price comparison shopping)
  • Credit Mix (10%) : Considers types of credit accounts (credit cards, retail accounts, instalment loans and mortgage loans)
    • PercentInstallTrades: Percent Installment Trades
    • NumRevolvingTradesWBalance: Number Revolving Trades with Balance
    • NumInstallTradesWBalance: Number Installment Trades with Balance
  • Other Category : Remaining features which could not be mapped to one of the above FICO categories
    • ExternalRiskEstimate: Consolidated risk estimation from other credit bureaus
    • NumTotalTrades: Number of Total Trades (total number of credit accounts)
In [5]:
display(heloc.describe().transpose())
count mean std min 25% 50% 75% max
ExternalRiskEstimate 10459.0 67.425758 21.121621 -9.0 63.0 71.0 79.0 94.0
MSinceOldestTradeOpen 10459.0 184.205373 109.683816 -9.0 118.0 178.0 249.5 803.0
MSinceMostRecentTradeOpen 10459.0 8.543455 13.301745 -9.0 3.0 5.0 11.0 383.0
AverageMInFile 10459.0 73.843293 38.782803 -9.0 52.0 74.0 95.0 383.0
NumSatisfactoryTrades 10459.0 19.428052 13.004327 -9.0 12.0 19.0 27.0 79.0
NumTrades60Ever2DerogPubRec 10459.0 0.042738 2.513910 -9.0 0.0 0.0 1.0 19.0
NumTrades90Ever2DerogPubRec 10459.0 -0.142843 2.367397 -9.0 0.0 0.0 0.0 19.0
PercentTradesNeverDelq 10459.0 86.661536 25.999584 -9.0 87.0 96.0 100.0 100.0
MSinceMostRecentDelq 10459.0 6.762406 20.501250 -9.0 -7.0 -7.0 14.0 83.0
MaxDelq2PublicRecLast12M 10459.0 4.928291 3.756275 -9.0 4.0 6.0 7.0 9.0
MaxDelqEver 10459.0 5.510183 3.971182 -9.0 5.0 6.0 8.0 8.0
NumTotalTrades 10459.0 20.856965 14.580905 -9.0 12.0 20.0 29.0 104.0
NumTradesOpeninLast12M 10459.0 1.253083 3.068684 -9.0 0.0 1.0 3.0 19.0
PercentInstallTrades 10459.0 32.166460 20.128634 -9.0 20.0 31.0 44.0 100.0
MSinceMostRecentInqexcl7days 10459.0 -0.325366 6.067556 -9.0 -7.0 0.0 1.0 24.0
NumInqLast6M 10459.0 0.868152 3.179304 -9.0 0.0 1.0 2.0 66.0
NumInqLast6Mexcl7days 10459.0 0.812602 3.143698 -9.0 0.0 1.0 2.0 66.0
NetFractionRevolvingBurden 10459.0 31.629888 30.060140 -9.0 5.0 25.0 54.0 232.0
NetFractionInstallBurden 10459.0 39.158906 42.101601 -9.0 -8.0 47.0 79.0 471.0
NumRevolvingTradesWBalance 10459.0 3.185008 4.413173 -9.0 2.0 3.0 5.0 32.0
NumInstallTradesWBalance 10459.0 0.976097 4.060995 -9.0 1.0 2.0 3.0 23.0
NumBank2NatlTradesWHighUtilization 10459.0 0.018071 3.358135 -9.0 0.0 0.0 1.0 18.0
PercentTradesWBalance 10459.0 62.079166 27.711565 -9.0 47.0 67.0 82.0 100.0

Investigation of Special Values

In [6]:
print("{:30s}  {:}  {:}  {:}".format("Feature", "No Record/Invest.", "No Usable/Valid Trade/Inq", "Cond. not met" ))
for col in heloc.columns:
    no_record = heloc.loc[heloc[col] == -9].shape[0]
    no_validtrades = heloc.loc[heloc[col] == -8].shape[0]
    no_inq_delq = heloc.loc[heloc[col] == -7].shape[0]
    print("{:30s} \t {:0,} \t\t\t {:0,} \t\t\t {:0,}".format(col[:30], no_record, no_validtrades , no_inq_delq ))
Feature                         No Record/Invest.  No Usable/Valid Trade/Inq  Cond. not met
RiskPerformance                	 0 			 0 			 0
ExternalRiskEstimate           	 598 			 0 			 0
MSinceOldestTradeOpen          	 588 			 239 			 0
MSinceMostRecentTradeOpen      	 588 			 0 			 0
AverageMInFile                 	 588 			 0 			 0
NumSatisfactoryTrades          	 588 			 0 			 0
NumTrades60Ever2DerogPubRec    	 588 			 0 			 0
NumTrades90Ever2DerogPubRec    	 588 			 0 			 0
PercentTradesNeverDelq         	 588 			 0 			 0
MSinceMostRecentDelq           	 588 			 176 			 4,664
MaxDelq2PublicRecLast12M       	 588 			 0 			 0
MaxDelqEver                    	 588 			 0 			 0
NumTotalTrades                 	 588 			 0 			 0
NumTradesOpeninLast12M         	 588 			 0 			 0
PercentInstallTrades           	 588 			 0 			 0
MSinceMostRecentInqexcl7days   	 588 			 476 			 1,855
NumInqLast6M                   	 588 			 0 			 0
NumInqLast6Mexcl7days          	 588 			 0 			 0
NetFractionRevolvingBurden     	 588 			 186 			 0
NetFractionInstallBurden       	 588 			 3,419 			 0
NumRevolvingTradesWBalance     	 588 			 156 			 0
NumInstallTradesWBalance       	 588 			 861 			 0
NumBank2NatlTradesWHighUtiliza 	 588 			 583 			 0
PercentTradesWBalance          	 588 			 18 			 0

There are 3 special values in the HELOC dataset:

  • No Bureau record or No Investigation (entry value =-9): There are at least 588 such entries detected for every feature provided in the dataset (excluding the target variable). If these entries belong to the same set of customers, the entries will be dropped as they offer no usable features to train the learning algorithm.
  • No Usable/Valid trades or Inquiries (entry value =-8): The NetFractionInstallBurden feature displays a large gap in terms of usable/valid trades. Once the importance of this feature is determined in predicting RiskPerformance, this particular gap might need to be revisited.
  • Conditions not met (e.g No Inquiries, no delinquencies) (entry value =-7): The MSinceMostRecentDelq feature has 4.6K entries with this special value, reflecting no previous delinquency. This might be a very useful piece of information in determining RiskPerformance.

Applicants with no bureau record or investigation

According to the data provider, when a loan applicant’s credit bureau report was either not investigated or not found, all features obtained from the credit bureau report receive a special value of -9. The confounding of no bureau report investigated (most likely a VIP applicant) and no bureau report found (a negative trait for extending credit) is most likely responsible for the two different outcomes.

All such records will be removed from the dataset due to their confounding nature.

In [7]:
no_rec= (heloc.loc[heloc['MSinceOldestTradeOpen'] == -9])
#display(no_rec)
display(no_rec.describe().transpose())
count mean std min 25% 50% 75% max
ExternalRiskEstimate 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MSinceOldestTradeOpen 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MSinceMostRecentTradeOpen 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
AverageMInFile 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumSatisfactoryTrades 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumTrades60Ever2DerogPubRec 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumTrades90Ever2DerogPubRec 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
PercentTradesNeverDelq 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MSinceMostRecentDelq 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MaxDelq2PublicRecLast12M 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MaxDelqEver 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumTotalTrades 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumTradesOpeninLast12M 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
PercentInstallTrades 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
MSinceMostRecentInqexcl7days 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumInqLast6M 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumInqLast6Mexcl7days 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NetFractionRevolvingBurden 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NetFractionInstallBurden 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumRevolvingTradesWBalance 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumInstallTradesWBalance 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
NumBank2NatlTradesWHighUtilization 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
PercentTradesWBalance 588.0 -9.0 0.0 -9.0 -9.0 -9.0 -9.0 -9.0
In [8]:
#Remove 588 entries with no bureau records
clean_norec_heloc = heloc.copy()
clean_norec_heloc.drop(clean_norec_heloc[clean_norec_heloc.MSinceOldestTradeOpen == -9].index, inplace=True)
In [9]:
n_unknown_DelqLast12M= (clean_norec_heloc.loc[clean_norec_heloc['MaxDelq2PublicRecLast12M'] > 7]).shape[0]
n_unknown_DelqEver_1= (clean_norec_heloc.loc[clean_norec_heloc['MaxDelqEver'] == 1]).shape[0]
n_unknown_DelqEver_7= (clean_norec_heloc.loc[clean_norec_heloc['MaxDelqEver'] == 7]).shape[0]
n_unknown_DelqEver_9= (clean_norec_heloc.loc[clean_norec_heloc['MaxDelqEver'] == 9]).shape[0]

print("Individuals with unusable delinquency data over last 12 months: {:0,}".format(n_unknown_DelqLast12M))
print("Individuals with no delinquency data: {:0,}".format(n_unknown_DelqEver_1))
print("Individuals with unknown delinquency: {:0,}".format(n_unknown_DelqEver_7))
print("Individuals with unusable delinquency data: {:0,}".format(n_unknown_DelqEver_9))
Individuals with unusable delinquency data over last 12 months: 6
Individuals with no delinquency data: 0
Individuals with unknown delinquency: 132
Individuals with unusable delinquency data: 0
In [10]:
display(clean_norec_heloc.loc[clean_norec_heloc['MaxDelq2PublicRecLast12M'] > 7].transpose())
89 2662 6413 6522 8721 9148
RiskPerformance Good Bad Good Bad Bad Bad
ExternalRiskEstimate -9 -9 93 -9 -9 80
MSinceOldestTradeOpen 17 92 425 115 175 168
MSinceMostRecentTradeOpen 17 82 63 55 159 1
AverageMInFile 17 87 179 78 167 34
NumSatisfactoryTrades 1 2 6 3 2 1
NumTrades60Ever2DerogPubRec 0 0 0 0 1 0
NumTrades90Ever2DerogPubRec 0 0 0 0 1 0
PercentTradesNeverDelq 100 100 100 100 100 100
MSinceMostRecentDelq -7 -7 -7 -7 -7 -7
MaxDelq2PublicRecLast12M 9 9 9 9 9 9
MaxDelqEver 8 8 8 8 8 8
NumTotalTrades 1 2 6 3 3 5
NumTradesOpeninLast12M 0 0 0 0 0 4
PercentInstallTrades 100 50 33 100 50 20
MSinceMostRecentInqexcl7days 0 -7 0 0 0 -7
NumInqLast6M 0 0 0 0 2 4
NumInqLast6Mexcl7days 0 0 0 0 2 4
NetFractionRevolvingBurden -8 -8 0 -8 -8 34
NetFractionInstallBurden -8 -8 -8 -8 -8 101
NumRevolvingTradesWBalance -8 -8 0 -8 -8 1
NumInstallTradesWBalance -8 -8 -8 -8 -8 1
NumBank2NatlTradesWHighUtilization -8 -8 -8 -8 -8 0
PercentTradesWBalance -8 -8 0 -8 -8 50
In [11]:
clean_nodelq_heloc = clean_norec_heloc.copy()
clean_nodelq_heloc.drop(clean_nodelq_heloc[clean_nodelq_heloc.MaxDelq2PublicRecLast12M > 7].index, inplace=True)
display(clean_nodelq_heloc.describe().transpose())
count mean std min 25% 50% 75% max
ExternalRiskEstimate 9865.0 72.008211 10.067575 -9.0 64.0 72.0 80.0 94.0
MSinceOldestTradeOpen 9865.0 195.732793 101.915892 -8.0 131.0 183.0 255.0 803.0
MSinceMostRecentTradeOpen 9865.0 9.556107 12.839190 0.0 3.0 6.0 12.0 383.0
AverageMInFile 9865.0 78.769083 34.041123 4.0 57.0 76.0 97.0 383.0
NumSatisfactoryTrades 9865.0 21.132793 11.315438 0.0 13.0 20.0 28.0 79.0
NumTrades60Ever2DerogPubRec 9865.0 0.581652 1.239084 0.0 0.0 0.0 1.0 19.0
NumTrades90Ever2DerogPubRec 9865.0 0.384896 0.993468 0.0 0.0 0.0 0.0 19.0
PercentTradesNeverDelq 9865.0 92.355297 11.774948 0.0 89.0 97.0 100.0 100.0
MSinceMostRecentDelq 9865.0 7.710289 20.731307 -8.0 -7.0 0.0 15.0 83.0
MaxDelq2PublicRecLast12M 9865.0 5.756006 1.643072 0.0 5.0 6.0 7.0 7.0
MaxDelqEver 9865.0 6.373543 1.849314 2.0 6.0 6.0 8.0 8.0
NumTotalTrades 9865.0 22.647238 12.995088 0.0 14.0 21.0 30.0 104.0
NumTradesOpeninLast12M 9865.0 1.864572 1.828047 0.0 0.0 1.0 3.0 19.0
PercentInstallTrades 9865.0 34.603953 17.932789 0.0 21.0 33.0 45.0 100.0
MSinceMostRecentInqexcl7days 9865.0 0.192904 5.853960 -8.0 0.0 0.0 1.0 24.0
NumInqLast6M 9865.0 1.456260 2.136449 0.0 0.0 1.0 2.0 66.0
NumInqLast6Mexcl7days 9865.0 1.397364 2.096378 0.0 0.0 1.0 2.0 66.0
NetFractionRevolvingBurden 9865.0 34.070654 29.204882 -8.0 8.0 28.0 56.0 232.0
NetFractionInstallBurden 9865.0 42.047035 41.607386 -8.0 -8.0 52.0 80.0 471.0
NumRevolvingTradesWBalance 9865.0 3.916371 3.348294 -8.0 2.0 3.0 5.0 32.0
NumInstallTradesWBalance 9865.0 1.575266 3.339507 -8.0 1.0 2.0 3.0 23.0
NumBank2NatlTradesWHighUtilization 9865.0 0.559655 2.604295 -8.0 0.0 1.0 1.0 18.0
PercentTradesWBalance 9865.0 66.351749 22.189028 -8.0 50.0 67.0 83.0 100.0

Feature Engineering

Feature Correlation

Based on the correlation heatmap below, the following features exhibit strong correlation:

  • NumInqLast6M and NumInqLast6Mexcl7days : NumInqLast6M will be dropped from the feature set as NumInqLast6Mexcl7days accounts for price shopping.
  • NumTrades60Ever2DerogPubRec and NumTrades90Ever2DerogPubRec : NumTrades90Ever2DerogPubRec will be dropped from the feature set as it is a subset of NumTrades60Ever2DerogPubRec
  • NumSatisfactoryTrades and NumTotalTrades: NumTotalTrades will be dropped after it has been used to convert features that are expressed as a % of number of trades (PercentTradesNeverDelq, PercentInstallTrades, PercentTradesWBalance)
In [12]:
from seaborn import heatmap
# Pretty display for notebooks
%matplotlib inline
plt.figure(figsize=(25, 25))
heatmap(clean_nodelq_heloc.iloc[:, 1:].corr(),annot= True, square= True, cmap="RdYlGn")
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd48961a58>
In [13]:
#clean_nodelq_heloc.drop(['NumInqLast6M','NumTrades90Ever2DerogPubRec'],axis=1, inplace=True)
clean_nodelq_heloc['NumTradesNeverDelq'] =  np.rint(clean_nodelq_heloc['PercentTradesNeverDelq']* clean_nodelq_heloc['NumTotalTrades']/100)
clean_nodelq_heloc['NumInstallTrades'] =  np.rint(clean_nodelq_heloc['PercentInstallTrades']* clean_nodelq_heloc['NumTotalTrades']/100)
clean_nodelq_heloc['NumTradesWBalance'] =  np.rint(clean_nodelq_heloc['PercentTradesWBalance']* clean_nodelq_heloc['NumTotalTrades']/100)

display(clean_nodelq_heloc.head().transpose())
0 1 2 3 4
RiskPerformance Bad Bad Bad Bad Bad
ExternalRiskEstimate 55 61 67 66 81
MSinceOldestTradeOpen 144 58 66 169 333
MSinceMostRecentTradeOpen 4 15 5 1 27
AverageMInFile 84 41 24 73 132
NumSatisfactoryTrades 20 2 9 28 12
NumTrades60Ever2DerogPubRec 3 4 0 1 0
NumTrades90Ever2DerogPubRec 0 4 0 1 0
PercentTradesNeverDelq 83 100 100 93 100
MSinceMostRecentDelq 2 -7 -7 76 -7
MaxDelq2PublicRecLast12M 3 0 7 6 7
MaxDelqEver 5 8 8 6 8
NumTotalTrades 23 7 9 30 12
NumTradesOpeninLast12M 1 0 4 3 0
PercentInstallTrades 43 67 44 57 25
MSinceMostRecentInqexcl7days 0 0 0 0 0
NumInqLast6M 0 0 4 5 1
NumInqLast6Mexcl7days 0 0 4 4 1
NetFractionRevolvingBurden 33 0 53 72 51
NetFractionInstallBurden -8 -8 66 83 89
NumRevolvingTradesWBalance 8 0 4 6 3
NumInstallTradesWBalance 1 -8 2 4 1
NumBank2NatlTradesWHighUtilization 1 -8 1 3 0
PercentTradesWBalance 69 0 86 91 80
NumTradesNeverDelq 19 7 9 28 12
NumInstallTrades 10 5 4 17 3
NumTradesWBalance 16 0 8 27 10

Transforming Categorical Features

The features MaxDelq2PublicRecLast12M and MaxDelqEver will have their values grouped in the following categories:

  • Derogatory
  • Delinquent (30, 60, 90, 120+ days)
  • Never delinquent (current)
  • Invalid (none of the above categories)

These categories will be one-hot encoded to add them as distinct features within the dataset.

In [14]:
delq12M_map = {1: 'DelinquentLast12M', 
                   2: 'DelinquentLast12M', 
                   3: 'DelinquentLast12M', 
                   4: 'DelinquentLast12M', 
                   5: 'InvalidDelqLast12M', 
                   6: 'InvalidDelqLast12M', 
                   7: 'NotDelinquentLast12M', 
                   8: 'InvalidDelqLast12M', 
                   9: 'InvalidDelqLast12M',
                   0: 'DerogatoryLast12M'}
delqEver_map = {1: 'InvalidDelq', 
                   2: 'Derogatory', 
                   3: 'Delinquent', 
                   4: 'Delinquent', 
                   5: 'Delinquent', 
                   6: 'Delinquent', 
                   7: 'InvalidDelq', 
                   8: 'NeverDelinquent', 
                   9: 'InvalidDelq',
                   }



clean_nodelq_heloc['MaxDelq2PublicRecLast12MLabel'] = clean_nodelq_heloc['MaxDelq2PublicRecLast12M'].map(delq12M_map)
clean_nodelq_heloc['MaxDelqEverLabel'] = clean_nodelq_heloc['MaxDelqEver'].map(delqEver_map)
full_features = pd.get_dummies(clean_nodelq_heloc,columns=["MaxDelq2PublicRecLast12MLabel","MaxDelqEverLabel"])
encoded = list(full_features.columns)
print("{} total features after one-hot encoding. \n".format(len(encoded)))
display(full_features.head().transpose())
35 total features after one-hot encoding. 

0 1 2 3 4
RiskPerformance Bad Bad Bad Bad Bad
ExternalRiskEstimate 55 61 67 66 81
MSinceOldestTradeOpen 144 58 66 169 333
MSinceMostRecentTradeOpen 4 15 5 1 27
AverageMInFile 84 41 24 73 132
NumSatisfactoryTrades 20 2 9 28 12
NumTrades60Ever2DerogPubRec 3 4 0 1 0
NumTrades90Ever2DerogPubRec 0 4 0 1 0
PercentTradesNeverDelq 83 100 100 93 100
MSinceMostRecentDelq 2 -7 -7 76 -7
MaxDelq2PublicRecLast12M 3 0 7 6 7
MaxDelqEver 5 8 8 6 8
NumTotalTrades 23 7 9 30 12
NumTradesOpeninLast12M 1 0 4 3 0
PercentInstallTrades 43 67 44 57 25
MSinceMostRecentInqexcl7days 0 0 0 0 0
NumInqLast6M 0 0 4 5 1
NumInqLast6Mexcl7days 0 0 4 4 1
NetFractionRevolvingBurden 33 0 53 72 51
NetFractionInstallBurden -8 -8 66 83 89
NumRevolvingTradesWBalance 8 0 4 6 3
NumInstallTradesWBalance 1 -8 2 4 1
NumBank2NatlTradesWHighUtilization 1 -8 1 3 0
PercentTradesWBalance 69 0 86 91 80
NumTradesNeverDelq 19 7 9 28 12
NumInstallTrades 10 5 4 17 3
NumTradesWBalance 16 0 8 27 10
MaxDelq2PublicRecLast12MLabel_DelinquentLast12M 1 0 0 0 0
MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M 0 1 0 0 0
MaxDelq2PublicRecLast12MLabel_InvalidDelqLast12M 0 0 0 1 0
MaxDelq2PublicRecLast12MLabel_NotDelinquentLast12M 0 0 1 0 1
MaxDelqEverLabel_Delinquent 1 0 0 1 0
MaxDelqEverLabel_Derogatory 0 0 0 0 0
MaxDelqEverLabel_InvalidDelq 0 0 0 0 0
MaxDelqEverLabel_NeverDelinquent 0 1 1 0 1
In [15]:
plt.figure(figsize=(25, 25))
heatmap(full_features.iloc[:, 1:].corr(),annot= True, square= True, cmap="RdYlGn")
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd48f34b00>

Removal of features exhibiting collinearity

The feature pairs exhibiting absolute correlation greater than 0.80 (in descending order) are listed below:

  • NumInqLast6M and NumInqLast6Mexcl7days : NumInqLast6M will be dropped from the feature set
  • NumTotalTrades and NumTradesNeverDelq : NumTotalTrades will be dropped from the feature set
  • MaxDelqEverLabel_NeverDelinquent and MaxDelq2PublicRecLast12MLabel_NotDelinquentLast12M : MaxDelq2PublicRecLast12MLabel_NotDelinquentLast12M will be dropped from the feature set
  • NumTrades60Ever2DerogPubRec and NumTrades90Ever2DerogPubRec : NumTrades90Ever2DerogPubRec will be dropped from the feature set
  • NumSatisfactoryTrades and NumTradesNeverDelq: NumTradesNeverDelq will be dropped from the feature set
  • MaxDelqEverLabel_Delinquent and MaxDelqEverLabel_NeverDelinquent : MaxDelqEverLabel_Delinquent will be dropped

In addition, the following features will be dropped as they have already been used to derive other features:

  • PercentTradesNeverDelq
  • PercentTradesWBalance
  • PercentInstallTrades
  • MaxDelq2PublicRecLast12M
  • MaxDelqEver
  • MaxDelq2PublicRecLast12MLabel_InvalidDelqLast12M
  • MaxDelqEverLabel_InvalidDelq
In [16]:
features_drop =['NumInqLast6M','NumTotalTrades','MaxDelq2PublicRecLast12MLabel_NotDelinquentLast12M',
                'NumTrades90Ever2DerogPubRec','NumTradesNeverDelq','MaxDelqEverLabel_Delinquent'
                ,'PercentTradesNeverDelq','PercentTradesWBalance','PercentInstallTrades',
                'MaxDelq2PublicRecLast12M','MaxDelqEver','MaxDelq2PublicRecLast12MLabel_InvalidDelqLast12M',
               'MaxDelqEverLabel_InvalidDelq']

reduced_features = full_features.copy()
reduced_features.drop(features_drop,axis=1, inplace=True)
num_features = list(reduced_features.columns)
print("{} Number of features after reduction \n".format(len(num_features)))

display(reduced_features.head().transpose())
22 Number of features after reduction 

0 1 2 3 4
RiskPerformance Bad Bad Bad Bad Bad
ExternalRiskEstimate 55 61 67 66 81
MSinceOldestTradeOpen 144 58 66 169 333
MSinceMostRecentTradeOpen 4 15 5 1 27
AverageMInFile 84 41 24 73 132
NumSatisfactoryTrades 20 2 9 28 12
NumTrades60Ever2DerogPubRec 3 4 0 1 0
MSinceMostRecentDelq 2 -7 -7 76 -7
NumTradesOpeninLast12M 1 0 4 3 0
MSinceMostRecentInqexcl7days 0 0 0 0 0
NumInqLast6Mexcl7days 0 0 4 4 1
NetFractionRevolvingBurden 33 0 53 72 51
NetFractionInstallBurden -8 -8 66 83 89
NumRevolvingTradesWBalance 8 0 4 6 3
NumInstallTradesWBalance 1 -8 2 4 1
NumBank2NatlTradesWHighUtilization 1 -8 1 3 0
NumInstallTrades 10 5 4 17 3
NumTradesWBalance 16 0 8 27 10
MaxDelq2PublicRecLast12MLabel_DelinquentLast12M 1 0 0 0 0
MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M 0 1 0 0 0
MaxDelqEverLabel_Derogatory 0 0 0 0 0
MaxDelqEverLabel_NeverDelinquent 0 1 1 0 1
In [17]:
plt.figure(figsize=(25, 25))
heatmap(reduced_features.iloc[:, 1:].corr(),annot= True, square= True, cmap="RdYlGn", linewidths=.05)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd4bef2470>

Feature Visualisation

In [18]:
#Produce KDEplot for a specific feature - visualise distribution of continuous numeric feature
def kdeplt(feature, rnd =-1):
    xlow= round(reduced_features[feature].min(),-1)
    xhigh =round(reduced_features[feature].max(),rnd)
    print('Low boundary : ', xlow,'High boundary : ', xhigh)
    plt.figure(figsize=(7,4))
    ax = sns.kdeplot(reduced_features[feature][reduced_features['RiskPerformance'] == 'Good'], color="darkturquoise", shade=True)
    sns.kdeplot(reduced_features[feature][reduced_features['RiskPerformance'] == 'Bad'], color="lightcoral", shade=True)
    plt.legend(['Good', 'Bad'])
    plt.title('Loan types by '+feature)
    ax.set(xlabel=feature)
    plt.xlim(xlow,xhigh)
    plt.show()
    
#Multi-bar Plot - discrete categorical attributes
def barplot(feature):
    plt.figure(figsize=(4,4))
    cp = sns.countplot(x=feature, hue="RiskPerformance", data=reduced_features)

    
# Using multiple Histograms - visualising numeric and categorical attributes together
def multhist(feature):
    fig = plt.figure(figsize = (6, 4))
    title = fig.suptitle('Loan types by '+feature)
    #fig.subplots_adjust(top=0.85, wspace=0.3)
    ax = fig.add_subplot(1,1, 1)
    ax.set_xlabel("MSinceOldestTradeOpen")
    ax.set_ylabel("Count") 

    g = sns.FacetGrid(reduced_features, hue='RiskPerformance')
    g.map(sns.distplot, feature, kde=False, bins=20, ax=ax)
    ax.legend(title='RiskPerformance')
    plt.close(2)

ExternalRiskEstimate

In [19]:
reduced_features['ExternalRiskEstimate'].describe()
Out[19]:
count    9865.000000
mean       72.008211
std        10.067575
min        -9.000000
25%        64.000000
50%        72.000000
75%        80.000000
max        94.000000
Name: ExternalRiskEstimate, dtype: float64
In [20]:
kdeplt('ExternalRiskEstimate',-2)
Low boundary :  -10 High boundary :  100
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

MSinceOldestTradeOpen

In [21]:
reduced_features['MSinceOldestTradeOpen'].describe()
Out[21]:
count    9865.000000
mean      195.732793
std       101.915892
min        -8.000000
25%       131.000000
50%       183.000000
75%       255.000000
max       803.000000
Name: MSinceOldestTradeOpen, dtype: float64
In [22]:
kdeplt('MSinceOldestTradeOpen',-3)
Low boundary :  -10 High boundary :  1000
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

MSinceMostRecentTradeOpen

In [23]:
reduced_features['MSinceMostRecentTradeOpen'].describe()
Out[23]:
count    9865.000000
mean        9.556107
std        12.839190
min         0.000000
25%         3.000000
50%         6.000000
75%        12.000000
max       383.000000
Name: MSinceMostRecentTradeOpen, dtype: float64
In [24]:
multhist('MSinceMostRecentTradeOpen')

AverageMInFile

In [25]:
reduced_features['AverageMInFile'].describe()
Out[25]:
count    9865.000000
mean       78.769083
std        34.041123
min         4.000000
25%        57.000000
50%        76.000000
75%        97.000000
max       383.000000
Name: AverageMInFile, dtype: float64
In [26]:
kdeplt('AverageMInFile')
Low boundary :  0 High boundary :  380
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

NumSatisfactoryTrades

In [27]:
reduced_features['NumSatisfactoryTrades'].describe()
Out[27]:
count    9865.000000
mean       21.132793
std        11.315438
min         0.000000
25%        13.000000
50%        20.000000
75%        28.000000
max        79.000000
Name: NumSatisfactoryTrades, dtype: float64
In [28]:
kdeplt('NumSatisfactoryTrades')
Low boundary :  0 High boundary :  80
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

NumTrades60Ever2DerogPubRec

In [29]:
reduced_features['NumTrades60Ever2DerogPubRec'].describe()
Out[29]:
count    9865.000000
mean        0.581652
std         1.239084
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        19.000000
Name: NumTrades60Ever2DerogPubRec, dtype: float64
In [30]:
multhist('NumTrades60Ever2DerogPubRec')

MSinceMostRecentDelq

In [31]:
reduced_features['MSinceMostRecentDelq'].describe()
Out[31]:
count    9865.000000
mean        7.710289
std        20.731307
min        -8.000000
25%        -7.000000
50%         0.000000
75%        15.000000
max        83.000000
Name: MSinceMostRecentDelq, dtype: float64
In [32]:
multhist('MSinceMostRecentDelq')

NumTradesOpeninLast12M

In [33]:
reduced_features['NumTradesOpeninLast12M'].describe()
Out[33]:
count    9865.000000
mean        1.864572
std         1.828047
min         0.000000
25%         0.000000
50%         1.000000
75%         3.000000
max        19.000000
Name: NumTradesOpeninLast12M, dtype: float64
In [34]:
kdeplt('NumTradesOpeninLast12M')
Low boundary :  0 High boundary :  20
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

MSinceMostRecentInqexcl7days

In [35]:
reduced_features['MSinceMostRecentInqexcl7days'].describe()
Out[35]:
count    9865.000000
mean        0.192904
std         5.853960
min        -8.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        24.000000
Name: MSinceMostRecentInqexcl7days, dtype: float64
In [36]:
multhist('MSinceMostRecentInqexcl7days')

NumInqLast6Mexcl7days

In [37]:
reduced_features['NumInqLast6Mexcl7days'].describe()
Out[37]:
count    9865.000000
mean        1.397364
std         2.096378
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        66.000000
Name: NumInqLast6Mexcl7days, dtype: float64
In [38]:
multhist('NumInqLast6Mexcl7days')

NetFractionRevolvingBurden

In [39]:
reduced_features['NetFractionRevolvingBurden'].describe()
Out[39]:
count    9865.000000
mean       34.070654
std        29.204882
min        -8.000000
25%         8.000000
50%        28.000000
75%        56.000000
max       232.000000
Name: NetFractionRevolvingBurden, dtype: float64
In [40]:
kdeplt('NetFractionRevolvingBurden')
Low boundary :  -10 High boundary :  230
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

NetFractionInstallBurden

In [41]:
reduced_features['NetFractionInstallBurden'].describe()
Out[41]:
count    9865.000000
mean       42.047035
std        41.607386
min        -8.000000
25%        -8.000000
50%        52.000000
75%        80.000000
max       471.000000
Name: NetFractionInstallBurden, dtype: float64
In [42]:
kdeplt('NetFractionInstallBurden')
Low boundary :  -10 High boundary :  470
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

NumRevolvingTradesWBalance

In [43]:
reduced_features['NumRevolvingTradesWBalance'].describe()
Out[43]:
count    9865.000000
mean        3.916371
std         3.348294
min        -8.000000
25%         2.000000
50%         3.000000
75%         5.000000
max        32.000000
Name: NumRevolvingTradesWBalance, dtype: float64
In [44]:
kdeplt('NumRevolvingTradesWBalance')
Low boundary :  -10 High boundary :  30
C:\Users\spkbh\newAnaconda\Anaconda3\envs\capstone\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

NumInstallTradesWBalance

In [45]:
reduced_features['NumInstallTradesWBalance'].describe()
Out[45]:
count    9865.000000
mean        1.575266
std         3.339507
min        -8.000000
25%         1.000000
50%         2.000000
75%         3.000000
max        23.000000
Name: NumInstallTradesWBalance, dtype: float64
In [46]:
multhist('NumInstallTradesWBalance')

NumBank2NatlTradesWHighUtilization

In [47]:
reduced_features['NumBank2NatlTradesWHighUtilization'].describe()
Out[47]:
count    9865.000000
mean        0.559655
std         2.604295
min        -8.000000
25%         0.000000
50%         1.000000
75%         1.000000
max        18.000000
Name: NumBank2NatlTradesWHighUtilization, dtype: float64
In [48]:
multhist('NumBank2NatlTradesWHighUtilization')

NumInstallTrades

In [49]:
reduced_features['NumInstallTrades'].describe()
Out[49]:
count    9865.000000
mean        7.644906
std         5.934926
min         0.000000
25%         4.000000
50%         6.000000
75%        10.000000
max        62.000000
Name: NumInstallTrades, dtype: float64
In [50]:
multhist('NumInstallTrades')

NumTradesWBalance

In [51]:
reduced_features['NumTradesWBalance'].describe()
Out[51]:
count    9865.000000
mean       14.532793
std         9.277824
min        -1.000000
25%         8.000000
50%        13.000000
75%        19.000000
max        82.000000
Name: NumTradesWBalance, dtype: float64
In [52]:
multhist('NumTradesWBalance')

MaxDelq2PublicRecLast12MLabel_DelinquentLast12M

In [53]:
reduced_features['MaxDelq2PublicRecLast12MLabel_DelinquentLast12M'].value_counts()
Out[53]:
0.0    7922
1.0    1943
Name: MaxDelq2PublicRecLast12MLabel_DelinquentLast12M, dtype: int64
In [54]:
barplot('MaxDelq2PublicRecLast12MLabel_DelinquentLast12M')

MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M

In [55]:
reduced_features['MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M'].value_counts()
Out[55]:
0.0    9529
1.0     336
Name: MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M, dtype: int64
In [56]:
barplot('MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M')

MaxDelqEverLabel_Derogatory

In [57]:
reduced_features['MaxDelqEverLabel_Derogatory'].value_counts()
Out[57]:
0.0    9126
1.0     739
Name: MaxDelqEverLabel_Derogatory, dtype: int64
In [58]:
barplot('MaxDelqEverLabel_Derogatory')

MaxDelqEverLabel_NeverDelinquent

In [59]:
reduced_features['MaxDelqEverLabel_NeverDelinquent'].value_counts()
Out[59]:
0.0    5339
1.0    4526
Name: MaxDelqEverLabel_NeverDelinquent, dtype: int64
In [60]:
barplot('MaxDelqEverLabel_NeverDelinquent')

Feature Selection

Principal Component Analysis

The first 2 principal components explain almost 85% of the variance in the data. The key features of the first three components are listed below:

  • First Component (72.5% variance): The two most prominent features are MSinceOldestTradeOpen and AverageMInFile. This component captures the length of the credit history of the client.
  • Second Component (12.1% variance): NetFractionInstallBurden and NetFractionRevolvingBurden are the main contributors to this component together with a negative contribution from AverageMInFile. This is intuitive as the debt burden increases monotically with RiskPerformance while AverageMInFile decreases monotically with RiskPerformance. This component captures the amount of debt of the client accounting for the length of his/her credit history.
In [63]:
from sklearn.decomposition import PCA
import visuals as vs
pca = PCA(n_components=3)
pca.fit(reduced_features.iloc[:, 1:])

# Generate PCA results plot
pca_results = vs.pca_results(reduced_features.iloc[:, 1:], pca,24,20)

Recursive Feature Selection

The maximum score of 71.6% is obtained when 18 features are used. From a dimensionality reduction perspective, 13 features provide a reasonable tradeoff between score (71.5%) and number of features. The 8 selected features can be grouped into 4 categories:

  • Payment History : NumSatisfactoryTrades, NumTrades60Ever2DerogPubRec, MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M and MaxDelqEverLabel_NeverDelinquent all relate to the previous payment history of the borrower.
  • New Credit : MSinceMostRecentInqexcl7days and NumInqLast6Mexcl7days reflect how active the borrower has been in seeking new loans.
  • Amount of Debt : NumRevolvingTradesWBalance is the number of active credit accounts held by the borrower.
  • ExternalRiskEstimate : This is the least transparent of all the features since it is the credit score assigned to the borrower by other credit bureaus based on their assessment of the creditworthiness of the borrower.

When ExternalRiskEstimate is dropped from the list of features and the recursive feature selection is rerun, the following 13 features result in a score of 71.5%:

  • Payment History : NumSatisfactoryTrades, NumTrades60Ever2DerogPubRec, MaxDelq2PublicRecLast12MLabel_DelinquentLast12M, MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M, MaxDelqEverLabel_Derogatory and MaxDelqEverLabel_NeverDelinquent all relate to the payment history of the borrower.
  • New Credit : NumTradesOpeninLast12M, MSinceMostRecentInqexcl7days and NumInqLast6Mexcl7days reflect the recent credit activity of the borrower.
  • Amount of Debt : NetFractionRevolvingBurden, NumRevolvingTradesWBalance and NumBank2NatlTradesWHighUtilization reflect to the amount of borrowing done relative to credit limits.
  • Length of credit history : AverageMInFile is the average duration of the credit accounts held by the borrower.
In [64]:
reduced_features = reduced_features.reset_index(drop=True)
riskperf_raw = reduced_features['RiskPerformance']
features = reduced_features.drop('RiskPerformance' , axis = 1)
dic={'Good':0, 'Bad':1}
riskperf = riskperf_raw.map(dic)
display(riskperf.head(n=5))
0    1
1    1
2    1
3    1
4    1
Name: RiskPerformance, dtype: int64
In [65]:
from gmean_score import Gscore
def scre(y_true, y_pred):
    scre = Gscore(y_true, y_pred)
    return scre.g_mean()   
In [66]:
import gmean_score as gm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

features1 = features.copy()
feat_drop =['ExternalRiskEstimate']
features1.drop(feat_drop,axis=1, inplace=True)

#List to store computed scores
scr=[]
model = LogisticRegression()

# RFE model for different number of features 
for i in range(1,23):
    print("Number of features: ",i)
    rfe = RFE(model, i)
    rfe = rfe.fit(features1, riskperf)
    # summarize the selection of the attributes
    print('Selected features: %s' % list(features1.columns[rfe.support_]))
    print('Score: %f' % rfe.score(features1, riskperf))
    scr.append(scre(riskperf, rfe.predict(features1)))
    print('G-mean: %f' % scr[i-1])
    

# Plot number of features VS. G_mean scores
plt.figure(figsize=(10,6))
plt.xticks(np.arange(0, 25, 1))
plt.xlabel("Number of features selected")
plt.ylabel("G_Mean score")
plt.plot(range(1, len(scr) + 1), scr)
plt.show()
Number of features:  1
Selected features: ['MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M']
Score: 0.520223
G-mean: 0.000000
Number of features:  2
Selected features: ['MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M']
Score: 0.583477
G-mean: 0.528225
Number of features:  3
Selected features: ['MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.615611
G-mean: 0.612499
Number of features:  4
Selected features: ['NumTrades60Ever2DerogPubRec', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.616624
G-mean: 0.613341
Number of features:  5
Selected features: ['NumTrades60Ever2DerogPubRec', 'NumInqLast6Mexcl7days', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.636391
G-mean: 0.636853
Number of features:  6
Selected features: ['NumTrades60Ever2DerogPubRec', 'NumInqLast6Mexcl7days', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.634567
G-mean: 0.634775
Number of features:  7
Selected features: ['NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'NumInqLast6Mexcl7days', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.635682
G-mean: 0.636202
Number of features:  8
Selected features: ['NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'NumInqLast6Mexcl7days', 'NumRevolvingTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.639838
G-mean: 0.640362
Number of features:  9
Selected features: ['NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'NumInqLast6Mexcl7days', 'NumRevolvingTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.681196
G-mean: 0.680652
Number of features:  10
Selected features: ['NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NumRevolvingTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.684136
G-mean: 0.683388
Number of features:  11
Selected features: ['NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.684237
G-mean: 0.683458
Number of features:  12
Selected features: ['NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.705930
G-mean: 0.705422
Number of features:  13
Selected features: ['AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.716371
G-mean: 0.715323
Number of features:  14
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.715459
G-mean: 0.714549
Number of features:  15
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.716675
G-mean: 0.715729
Number of features:  16
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717385
G-mean: 0.716500
Number of features:  17
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.716878
G-mean: 0.716075
Number of features:  18
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717081
G-mean: 0.716227
Number of features:  19
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NetFractionInstallBurden', 'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717587
G-mean: 0.716780
Number of features:  20
Selected features: ['MSinceOldestTradeOpen', 'MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NetFractionInstallBurden', 'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717081
G-mean: 0.716188
Number of features:  21
Selected features: ['MSinceOldestTradeOpen', 'MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NetFractionInstallBurden', 'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717081
G-mean: 0.716188
Number of features:  22
Selected features: ['MSinceOldestTradeOpen', 'MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NetFractionInstallBurden', 'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']
Score: 0.717081
G-mean: 0.716188

Recursive Feature Selection with Cross Validation

Even with cross-validation of the dataset, the results are very similar to the recursive feature elimination (without cv). The highest score is obtained when using 18 features.

In [67]:
from sklearn.feature_selection import RFECV
from sklearn.metrics import make_scorer

scoring_fnc= make_scorer(scre)
rfecv = RFECV(estimator=LogisticRegression(), step=1, cv=10, scoring=scoring_fnc)
rfecv.fit(features1, riskperf)

print("Optimal number of features: %d" % rfecv.n_features_)
print('Selected features: %s' % list(features1.columns[rfecv.support_]))

# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10,6))
plt.xticks(np.arange(0, 25, 1))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation Gmean score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
Optimal number of features: 17
Selected features: ['MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades', 'NumTrades60Ever2DerogPubRec', 'MSinceMostRecentDelq', 'NumTradesOpeninLast12M', 'MSinceMostRecentInqexcl7days', 'NumInqLast6Mexcl7days', 'NetFractionRevolvingBurden', 'NumRevolvingTradesWBalance', 'NumBank2NatlTradesWHighUtilization', 'NumInstallTrades', 'NumTradesWBalance', 'MaxDelq2PublicRecLast12MLabel_DelinquentLast12M', 'MaxDelq2PublicRecLast12MLabel_DerogatoryLast12M', 'MaxDelqEverLabel_Derogatory', 'MaxDelqEverLabel_NeverDelinquent']

Tree-Based Feature Selection

Using a decision tree classifer, an almost perfect G-mean score (99.5%) is achieved for a tree depth of 25 levels, both with and without the feature ExternalRiskEstimate. While the large tree depth is not ideal for interpretability, it was still a useful exercise to identify the most important features resulting in the very high predictive power. The most important features, excluding ExternalRiskEstimate due to its lack of transparency, are grouped in the following categories:

  • Amount of Debt : NetFractionRevolvingBurden and NumTradesWBalance reflect the various loans taken by the borrow.
  • Length of Credit History : MSinceOldestTradeOpen and AverageMInFile reflect the length of the borrower's credit history.
  • Payment History : NumSatisfactoryTrades, NumTrades60Ever2DerogPubRec and MSinceMostRecentDelq reflect the payment history of the borrower.
  • New Credit : MSinceMostRecentInqexcl7days is the number of months since the borrower last made an inquiry for a new loan.
  • Credit Mix : NumInstallTrades is the number of instalment loans held by the borrower.
In [68]:
from sklearn.tree import DecisionTreeClassifier

for i in range(1,26,4):
    print('\nTree depth: ', i)
    classifier = DecisionTreeClassifier(max_depth=i)
    classifier.fit(features, riskperf)
    print('G-mean score is: ', scre(riskperf, classifier.predict(features)))
    print('Feature importance:')
    for name, importance in zip(features.columns, classifier.feature_importances_):
        if importance>0.05:
            print(name, importance)
Tree depth:  1
G-mean score is:  0.7043989600444821
Feature importance:
ExternalRiskEstimate 1.0

Tree depth:  5
G-mean score is:  0.7294862995319136
Feature importance:
ExternalRiskEstimate 0.791538609605
MSinceMostRecentInqexcl7days 0.0943283884603

Tree depth:  9
G-mean score is:  0.7966098102334421
Feature importance:
ExternalRiskEstimate 0.522127427453
AverageMInFile 0.0686633416599
NumSatisfactoryTrades 0.060656670547
MSinceMostRecentInqexcl7days 0.0751545188118

Tree depth:  13
G-mean score is:  0.8877800918795992
Feature importance:
ExternalRiskEstimate 0.343830806424
MSinceOldestTradeOpen 0.058224401301
AverageMInFile 0.0828729911137
NumSatisfactoryTrades 0.0620099517636
MSinceMostRecentInqexcl7days 0.0588265463491
NetFractionRevolvingBurden 0.0732094329742

Tree depth:  17
G-mean score is:  0.9600826759791377
Feature importance:
ExternalRiskEstimate 0.278536454418
MSinceOldestTradeOpen 0.0726377013628
AverageMInFile 0.0876917410361
NumSatisfactoryTrades 0.0622038204075
MSinceMostRecentInqexcl7days 0.0539443520953
NetFractionRevolvingBurden 0.0742385803318
NumTradesWBalance 0.0551544494146

Tree depth:  21
G-mean score is:  0.9888144814449895
Feature importance:
ExternalRiskEstimate 0.258847429134
MSinceOldestTradeOpen 0.0821142549944
AverageMInFile 0.0828992180573
NumSatisfactoryTrades 0.0602446030714
NetFractionRevolvingBurden 0.0776347282999
NetFractionInstallBurden 0.0514660133117
NumTradesWBalance 0.0519927415221

Tree depth:  25
G-mean score is:  0.9974982659016383
Feature importance:
ExternalRiskEstimate 0.258540253503
MSinceOldestTradeOpen 0.0765983351432
AverageMInFile 0.0843620538228
NumSatisfactoryTrades 0.064099424518
NetFractionRevolvingBurden 0.0752376354294
NumTradesWBalance 0.0556409213551

ExternalRiskEstimate is the most important feature for the varying tree depths. However, since it is not very transparent how this feature is calculated by other credit bureaus, the decision tree classifier will be rerun excluding this feature to assess the impact on the Gmean score.

In [70]:
for i in range(1,26,4):
    print('\nTree depth: ', i)
    classifier1 = DecisionTreeClassifier(max_depth=i)
    classifier1.fit(features1, riskperf)
    print('G-mean score is: ', scre(riskperf, classifier1.predict(features1)))
    print('Feature importance:')
    for name, importance in zip(features1.columns, classifier1.feature_importances_):
        if importance>0.05:
            print(name, importance)
Tree depth:  1
G-mean score is:  0.6430585718897202
Feature importance:
NetFractionRevolvingBurden 1.0

Tree depth:  5
G-mean score is:  0.7193717316560352
Feature importance:
AverageMInFile 0.167393512462
NumTrades60Ever2DerogPubRec 0.154644178497
MSinceMostRecentInqexcl7days 0.146262123768
NetFractionRevolvingBurden 0.420093567768
MaxDelq2PublicRecLast12MLabel_DelinquentLast12M 0.0788743999497

Tree depth:  9
G-mean score is:  0.781425131677077
Feature importance:
AverageMInFile 0.142541698715
NumTrades60Ever2DerogPubRec 0.101909842318
MSinceMostRecentInqexcl7days 0.115933533741
NetFractionRevolvingBurden 0.301927110518
MaxDelq2PublicRecLast12MLabel_DelinquentLast12M 0.0519019909939

Tree depth:  13
G-mean score is:  0.8752879649770393
Feature importance:
MSinceOldestTradeOpen 0.0642405773354
AverageMInFile 0.124759645984
NumSatisfactoryTrades 0.0583637089612
NumTrades60Ever2DerogPubRec 0.0665621329555
MSinceMostRecentDelq 0.0582880713819
MSinceMostRecentInqexcl7days 0.0792942247731
NetFractionRevolvingBurden 0.218172415869

Tree depth:  17
G-mean score is:  0.9466754828347227
Feature importance:
MSinceOldestTradeOpen 0.0748858697627
AverageMInFile 0.118101465077
NumSatisfactoryTrades 0.0692009793044
NumTrades60Ever2DerogPubRec 0.0559461211976
MSinceMostRecentDelq 0.0530489631524
MSinceMostRecentInqexcl7days 0.0653312189737
NetFractionRevolvingBurden 0.188540951768
NumTradesWBalance 0.0522698466175

Tree depth:  21
G-mean score is:  0.9852660720444051
Feature importance:
MSinceOldestTradeOpen 0.0842085713335
AverageMInFile 0.116868412812
NumSatisfactoryTrades 0.072245605074
MSinceMostRecentDelq 0.0548103356538
MSinceMostRecentInqexcl7days 0.0614446154022
NetFractionRevolvingBurden 0.17499582563
NumTradesWBalance 0.0577007896471

Tree depth:  25
G-mean score is:  0.9985135777603634
Feature importance:
MSinceOldestTradeOpen 0.0853846775186
AverageMInFile 0.119818313901
NumSatisfactoryTrades 0.0722391299199
MSinceMostRecentDelq 0.0526949587456
MSinceMostRecentInqexcl7days 0.0614139239378
NetFractionRevolvingBurden 0.168479347416
NumTradesWBalance 0.0575292272435

Final Feature Selection

Using PCA, RFE and tree feature selection, the following 9 features consistently came up as important and 8 of them (ExternalRiskEstimate) were included in the final list of features for model training.

  • Length of Credit History :
    • MSinceOldestTradeOpen
    • AverageMInFile
  • Amount of debt :
    • NetFractionRevolvingBurden
    • NetFractionInstallBurden - No imputation will be done for the 3415 records where there is no data on the burden.
    • NumBank2NatlTradesWHighUtilization
  • Payment history :
    • NumSatisfactoryTrades
    • NumTrades60Ever2DerogPubRec
  • New Credit :
    • MSinceMostRecentInqexcl7days
  • ExternalRiskEstimate
In [71]:
final_features =['RiskPerformance','MSinceOldestTradeOpen','AverageMInFile','NetFractionRevolvingBurden','NetFractionInstallBurden','NumBank2NatlTradesWHighUtilization','NumSatisfactoryTrades','NumTrades60Ever2DerogPubRec','MSinceMostRecentInqexcl7days']
heloc_final = reduced_features[final_features]
display(heloc_final.head())
RiskPerformance MSinceOldestTradeOpen AverageMInFile NetFractionRevolvingBurden NetFractionInstallBurden NumBank2NatlTradesWHighUtilization NumSatisfactoryTrades NumTrades60Ever2DerogPubRec MSinceMostRecentInqexcl7days
0 Bad 144 84 33 -8 1 20 3 0
1 Bad 58 41 0 -8 -8 2 4 0
2 Bad 66 24 53 66 1 9 0 0
3 Bad 169 73 72 83 3 28 1 0
4 Bad 333 132 51 89 0 12 0 0
In [72]:
heloc_final.to_excel('heloc_featureset_final.xlsx',index= False)